https://frederik-n-h-lundgren.github.io/
The workload was divided equally
https://github.com/Frederik-N-H-Lundgren/frederik-n-h-lundgren.github.io.git
import pandas as pd
import gzip
import json
from tqdm import tqdm
import networkx as nx
import netwulf
import numpy as np
import ast
import statistics
from itertools import chain
import math
import community
from collections import defaultdict
import matplotlib.pyplot as plt
from nltk.tokenize import MWETokenizer
from nltk.tokenize import word_tokenize
import re
from nltk.stem import PorterStemmer
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
from collections import Counter
from nltk.util import bigrams
from scipy.stats import chi2
from wordcloud import WordCloud
Our dataset is a subset from the ready made large-scale Amazon Reviews dataset, collected in 2018 by Jianmo Ni, UCSD. The dataset was ready made and directly available for downloading at: https://nijianmo.github.io/amazon/index.html Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley EMNLP, 2019
The subset we have choosen is within the Petsupply category. The dataset contains two files: 1 for the reviews of the different products within the category and 1 for metadata of the products. For the reviews we have used the 5 core dataset which means there are at least 5 reviews for each product.
All of Amazon's information is huge,so, we'll concentrate on just one group. Pets are cute, and we love our own dog, and buy him a lot of toys and snaks so therefore we have choosen the pet category.
We chose to use the 5-core reviews because all the reviews was very large and too computationally heavy to work with, and it focuses on a specific subset of reviews,those considered most informative or relevan, which can save computational resources while still providing valuable insights.
To identify What factors influence the co-purchasing patterns of pet supply customers, and these factors affect their product reviews? Also to find out if the price has impact on reviews.
The original contains much more attritubuts than those we have choosen to work with in this notebook, therefore we have deleted those colums and saved new files to work with.(this notebook does not contain the original data)
The 5-core reviews was still to big to process so we decided to only use the reviews from after 2017, this mean we are not guaranteed to have 5 reviews for each product, and we instead have a subset of the reviews from 2017-2018 We checked if the two dataframes contains dublicates and deleted those. The metadata products is checked if it is included in the reviews, meaning it had at least one review, otherwise it was deleted to make sure all products we work with have at a review. We also check the other way around, if the reviews product id is in the metadata.
later on when we work with the price the data will be cleaned ones more fore those product that don't have a price.
The data after cleaning have 6 attributs and 31696 items for the metadata: category,description,title,also_buy,price,asin(product id)
Review data have 5 attributs and 496672 reviews: asin,reviewText,overall(rating),reviewTime,unixReviewTime
The dataset loaded in is a subset of the downloaded data as we only kept the colums that we think we will need, this reduces the runtime each time we need to restart the notebook
dfmeta = pd.read_csv('dfmeta.txt', sep='\t')
dfmeta['also_buy'] = dfmeta['also_buy'].apply(ast.literal_eval)
dfmeta['category'] = dfmeta['category'].apply(ast.literal_eval)
dfmeta = dfmeta.drop_duplicates(subset='asin')
df_review = pd.read_csv('df_review.txt', sep='\t')
df_review['reviewTime'] = pd.to_datetime(df_review['reviewTime'], format='%m %d, %Y')
# Filter out reviews made before 2010
df_review = df_review[df_review['reviewTime'].dt.year >= 2017]
#filter dublicates
df_review = df_review.drop_duplicates(subset='reviewText')
# Reset the index of the DataFrame
df_review.reset_index(drop=True, inplace=True)
# df_review now contains the DataFrame with reviews made in or after 2017
review_asin_values = df_review['asin'].unique()
# Filter df_edges based on whether 'asin' values are present in df_review
dfmeta = dfmeta[dfmeta['asin'].isin(review_asin_values)]
dfmeta.reset_index(drop=True, inplace=True)
# Filter df_review based on whether 'asin' values are present in dfmeta
df_review = df_review[df_review['asin'].isin(dfmeta['asin'])]
dfmeta
| category | description | title | also_buy | price | asin | |
|---|---|---|---|---|---|---|
| 0 | [Pet Supplies, Top Selection from AmazonPets] | ['Volume 1: 96 Words & Phrases! This is th... | Pet Media Feathered Phonics The Easy Way To Te... | [B0002FP328, B0002FP32S, B0002FP32I, B00CAMARX... | $6.97 | 0972585419 |
| 1 | [Pet Supplies, Dogs, Health Supplies] | ["Our Dog Whisperer with Cesar Milan Complete ... | Dog Whisperer With Cesar Millan: Season 1 | [B000QXDFSA, B0018BD9DK, B002RJ8YDM, B002UJIY3... | NaN | 1417084871 |
| 2 | [Pet Supplies, Dogs, Treats] | ['"You won\'t want to miss this one from Paris... | The Healthy Hound Cookbook: Over 125 Easy Reci... | [1617690554, 1449409938, 1604334657, 163220674... | $14.75 | 1440572828 |
| 3 | [Pet Supplies, Dogs, Health Supplies, Hip &... | ['Dr. Rexy hemp oil has powerful anti-inflamma... | DR.REXY Hemp Oil for Dogs and Cats - 100% Orga... | [] | $19.90 | 1612231977 |
| 4 | [Pet Supplies, Dogs] | ['At last! A comprehensive, holistic guide for... | Natural Cures for Your Dog & Cat | [] | $19.91 | 1882330919 |
| ... | ... | ... | ... | ... | ... | ... |
| 31691 | [Pet Supplies, Dogs, Collars, Harnesses & Leas... | ['Full Grip Supply Camo E-Bungee Collar is a r... | Full Grip Supply Camo E-Bungee Collar for Educ... | [B01KAX8QIO, B00W18D3F8, B005CXJ2OA, B005MJ65Z... | $16.99 | B01HIIJ4US |
| 31692 | [Pet Supplies, Cats, Flea & Tick Control, Flea... | ['Kills fleas, flea eggs, flea larvae and cont... | Sergeants Pet Care Prod 03282 Cat Flea/Tick Co... | [] | $3.34 | B01HIJGHOS |
| 31693 | [Pet Supplies, Cats, Flea & Tick Control, Flea... | ['Premium Quality Flea Comb for Dogs, Cats and... | #1 Pet Flea Comb For Dogs And Cats By Pet's Mu... | [B00JUQVR7U] | NaN | B01HIPJRBM |
| 31694 | [Pet Supplies, Dogs, Health Supplies, Suppleme... | ['Advita for Dogs is a blend of multiple probi... | VetOne Advita Probiotic Nutritional Supplement... | [B00DCV5E28, B078Y63641, B006CBD7LK, B077GHNQG... | $17.37 | B01HIQ9NGU |
| 31695 | [] | ['Latex Dog Toy Prepacks are creative combinat... | Zanies small latex dog toy with squeaker Pack ... | [] | $17.99 | B01HIV7FC4 |
31696 rows × 6 columns
We ctreated a network, which was based of Amazon petsupply items as nodes, and if a co-purchase between to items was made, this would be an edge, which was found as an attribute "also_buy". Some basic graph analysis was executed, looking at the average degree, mode and other aspects. Here we also found items (nodes) that had highest degree, because these are items that are typically bought with other items, and are of interrest as a recommendation. After this we looked at the modularity, which gave a high number, so a community detection was the obvious next analysis. Here we found that the graph is compiled of one main graph, with a large number of really low degree nodes, or nodes with o degree. For the largest connected component, we again did some basic analysis, and we could then lead this community into an analysis of the reviews.
The reviews were converted to lowercase and tokenized. The tokens were created by exlueding punctuation, URLs, mathematical symbols, and numbers. All the tokens was stemmed and stopwords removed. All tokens were compiled into one comprehensive token list to identify some of the most commom words in all reviews. For each review we have computed a sentiment score, and this was made to an average score for each item to see if there is a correlation between sentiment and other scores. We checked for possible bigrams to see if there are some context that we might be missing that should be added. For each commuity we computed some of the most commen words and the IDF for these words to crete a better understandning of what maked it unique and sepereated from other.
Some of the analysis we did was on the higher and lower, ratings and sentimenct score to see if there was any factor that produced better and more liked items. We could not find a correlation between these, as the top items appeared basically identical to the bottom items in terms of plots and ratings. Another great way to help us understand some of the co-purchasing patterns was looking at word clouds for the different communities. This gave us an indication in what the theme of the products in each community was, and what items are bought together by the consumers.
In our network, nodes represent petsupply products, with a direct link from node A to node B indicating that they have been purchased together before.
# Initialize an undirected graph
G = nx.Graph()
# Add nodes from 'asin' column
G.add_nodes_from(dfmeta['asin'])
# Explode the 'also_buy' column to create multiple rows
df_exploded = dfmeta[['asin', 'also_buy']].explode('also_buy')
df_edges = df_exploded.dropna()
# Filter df_edges based on whether values in 'also_buy' column are in dfmeta['asin']
# remove things outside category
df_edges = df_edges[df_edges['also_buy'].isin(dfmeta['asin'])]
edges = [(row['asin'], row['also_buy']) for _, row in df_edges.iterrows()]
G.add_edges_from(edges)
Here is some basic network analysis
num_nodes = len(G.nodes)
num_edges = len(G.edges)
print("Products:", num_nodes)
print("Products bought together (edges):", num_edges)
Products: 31696 Products bought together (edges): 205745
max_edges = num_nodes * (num_nodes - 1) / 2
print("maximum amount of links:", max_edges)
print("Density of network:", (num_edges/max_edges)*100)
maximum amount of links: 502302360.0 Density of network: 0.04096038887812512
We can see that the network is not dense at all. We have very few edges in comparison with how many possible edges we can have.
print("Is the network connected?", nx.is_connected(G))
num_connected_components = nx.number_connected_components(G)
print("There is",num_connected_components, "subsets within the graph")
Is the network connected? False There is 10457 subsets within the graph
isolated_nodes = [node for node, degree in G.degree() if degree == 0]
print("There is:", len(isolated_nodes),"isolated nodes/products in the network")
There is: 10316 isolated nodes/products in the network
As we could see from the dataframe there are some products that havn't been purchsed with other products therefore the network will not be connected. By browsing through the data we could also see that there are a big amount of products that have no also_buy, and therefore the network is not dense which aligns with our expactation.
degrees = [degree for node, degree in G.degree()]
avg_degree = np.mean(degrees)
median_degree = statistics.median(degrees)
degree_hist = nx.degree_histogram(G)
mode_degree = degree_hist.index(max(degree_hist))
min_degree = min(degrees)
max_degree = max(degrees)
print("Average:", avg_degree)
print("Median:", median_degree)
print("Mode:", mode_degree)
print("Minimum:", min_degree)
print("Maximum:", max_degree)
Average: 12.982395254921757 Median: 3.0 Mode: 0 Minimum: 0 Maximum: 1171
We can see from these numbers that the product bought most commmenly bought with other items, was bought with 1171 product and there is also a lot of produts that isn't bought with other products, which is indicated by the mode being 0. We can see from the median that the node distrubution from these number seems very skewered, therefore we can't get much information from the averge.
degrees = dict(G.degree())
# Sort the nodes based on their degree in descending order
sorted_nodes = sorted(degrees, key=degrees.get, reverse=True)
# Get the top 5 nodes with the highest degree
top_5_nodes = sorted_nodes[:5]
print("Top 5 nodes with the highest degree:")
for node in top_5_nodes:
print("Node:", node, "Degree:", degrees[node])
Top 5 nodes with the highest degree: Node: B001HBBQKY Degree: 1171 Node: B0009X29WK Degree: 858 Node: B000255NCI Degree: 800 Node: B0002A5VK2 Degree: 714 Node: B0002563MW Degree: 696
for node in top_5_nodes:
node_attributes = dfmeta.loc[dfmeta['asin'] == node].squeeze()
print(node_attributes)
category [Pet Supplies, Dogs, Treats, Cookies, Biscuits...
description ['Wellness Just for Puppy Natural Dog Treats a...
title Wellness Soft Puppy Bites Natural Grain Free ...
also_buy []
price $2.99
asin B001HBBQKY
Name: 6741, dtype: object
category [Pet Supplies, Cats, Litter & Housebreaking, L...
description ['A clay litter uniquely formulated combining ...
title Dr. Elsey's Cat Ultra Premium Clumping Cat Lit...
also_buy []
price .a-box-inner{background-color:#fff}#alohaBuyBo...
asin B0009X29WK
Name: 2950, dtype: object
category [Pet Supplies, Fish & Aquatic Pets, Aquarium T...
description ['Most water problems are invisible to the eye...
title API Master Test Kits
also_buy []
price $14.95
asin B000255NCI
Name: 245, dtype: object
category [Pet Supplies, Fish & Aquatic Pets, Aquari...
description ['Seachem Purigen 100ml', '', '']
title Seachem Purigen for Freshwater & Saltwater
also_buy [B00029PO6O, B00BS96U60, B00B50UPE0, B00JE5W4Y...
price $8.22
asin B0002A5VK2
Name: 628, dtype: object
category [Pet Supplies, Fish & Aquatic Pets, Aquarium P...
description ['The Penn Plax Airline Tubing for Aquariums i...
title Penn Plax Airline Tubing for Aquariums –...
also_buy []
price $4.97
asin B0002563MW
Name: 294, dtype: object
We can see that 3 of the top 5 product from Fish & Aquatic Pets(same group) and there is 1 for cat and the most purched is a dogtreat.
As we what to analyse co-purchases we want to see the components in our graph eg. the nodes that have connections to others.
components = nx.connected_components(G)
component_sizes = [len(component) for component in components]
large_sorted_sizes = sorted(component_sizes, reverse=True)
# Select the sizes of the 50 largest components
largest_sizes = large_sorted_sizes[:25]
largest_sizes
[21049, 6, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
A lot of the products are just bought as a pair, or smaller groups, with no connection to the larger graph.
components = nx.connected_components(G)
largest_component = max(components, key=len)
# Create a new graph containing only the nodes and edges of the largest component
largest_component_graph = G.subgraph(largest_component)
len(largest_component_graph.nodes())
21049
len(largest_component_graph.edges())
205548
degrees = [degree for node, degree in largest_component_graph.degree()]
avg_degree = np.mean(degrees)
median_degree = statistics.median(degrees)
degree_hist = nx.degree_histogram(largest_component_graph)
mode_degree = degree_hist.index(max(degree_hist))
min_degree = min(degrees)
max_degree = max(degrees)
print("Average:", avg_degree)
print("Median:", median_degree)
print("Mode:", mode_degree)
print("Minimum:", min_degree)
print("Maximum:", max_degree)
Average: 19.53042899900233 Median: 9 Mode: 1 Minimum: 1 Maximum: 1171
As majorty of the nodes are in the largest component we want to analyse this specificlly to uncover some of the co-purchasing patterns. In this graph we see that the minimun is now 1 and we have a much bigger median and average as the nodes with 0 degree is no longer a part of it, but the distrubution is still skewered.
def compute_modularity(graph, partitioning):
# Total number of edges in the graph
L = graph.number_of_edges()
modularity = 0
for community in set(partitioning.values()):
# Nodes in the current community
nodes_in_community = [node for node, comm in partitioning.items() if comm == community]
L_c = sum(1 for u, v in graph.edges(nodes_in_community) if partitioning[u] == partitioning[v])
k_c = sum(graph.degree(node) for node in nodes_in_community)
modularity += L_c / L - (k_c / (2 * L)) ** 2
return modularity
partition = community.best_partition(largest_component_graph)
print("The amount of communities is:", len(set(partition.values())), "which is community",set(partition.values()) )
The amount of communities is: 45 which is community {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44}
items_modularity = compute_modularity(largest_component_graph,partition)
items_modularity
0.7829315576241409
As we have a really high modularity we expect to see some clear communities, with similar items being close to eachother.
group_counts = defaultdict(int)
for node, group_id in partition.items():
group_counts[group_id] += 1
sorted_groups = sorted(group_counts.items(), key=lambda x: x[1], reverse=True)
# Print the count of nodes in each group
for group_id, count in sorted_groups:
print("Group", group_id, "has", count, "nodes")
Group 2 has 4402 nodes Group 3 has 3230 nodes Group 5 has 2741 nodes Group 16 has 2216 nodes Group 8 has 1923 nodes Group 34 has 1247 nodes Group 0 has 1060 nodes Group 37 has 960 nodes Group 18 has 711 nodes Group 15 has 611 nodes Group 6 has 470 nodes Group 30 has 337 nodes Group 20 has 239 nodes Group 27 has 168 nodes Group 23 has 109 nodes Group 35 has 109 nodes Group 7 has 92 nodes Group 19 has 80 nodes Group 21 has 71 nodes Group 43 has 31 nodes Group 28 has 27 nodes Group 24 has 21 nodes Group 29 has 20 nodes Group 22 has 18 nodes Group 11 has 18 nodes Group 12 has 15 nodes Group 14 has 14 nodes Group 41 has 12 nodes Group 26 has 12 nodes Group 44 has 11 nodes Group 36 has 10 nodes Group 40 has 8 nodes Group 33 has 6 nodes Group 10 has 6 nodes Group 4 has 5 nodes Group 39 has 5 nodes Group 31 has 5 nodes Group 42 has 4 nodes Group 38 has 4 nodes Group 9 has 4 nodes Group 13 has 4 nodes Group 17 has 4 nodes Group 25 has 3 nodes Group 32 has 3 nodes Group 1 has 3 nodes
dfmeta['group'] = dfmeta['asin'].map(partition)
for node in tqdm(largest_component_graph.nodes()):
largest_component_graph.nodes[node]['group'] = dfmeta.loc[dfmeta['asin'] == node, 'group'] .values[0]
100%|████████████████████████████████████| 21049/21049 [00:44<00:00, 470.88it/s]
netwulf.interactive.visualize(largest_component_graph)
plt.show()
assortativity_degree = nx.degree_assortativity_coefficient(largest_component_graph)
assortativity_degree
-0.06030633355263227
Degree assortativity measures the tendency of nodes with similar degrees to be connected to each other in a network. A positive assortativity coefficient indicates that nodes with similar degrees are more likely to be connected, while a negative coefficient suggests that nodes with different degrees tend to be connected. A score of -0.0602 indicates the structure is like a random network, where connections are made without any preference based on node degrees. This indicates that there is no significant tendency for nodes with similar degrees to be connected.
df_review
| asin | reviewText | overall | reviewTime | unixReviewTime | |
|---|---|---|---|---|---|
| 0 | 1440572828 | I was curious about making home cooked food to... | 5.0 | 2017-04-21 | 1492732800 |
| 1 | 1440572828 | Really good book | 5.0 | 2017-03-22 | 1490140800 |
| 2 | 1440572828 | Wish you had more recipes for treats | 3.0 | 2017-02-12 | 1486857600 |
| 3 | 1440572828 | Nice book. Can't wait to try these recipes | 5.0 | 2017-02-08 | 1486512000 |
| 4 | 1612231977 | I am disappointed in the quality of these. Th... | 1.0 | 2018-03-29 | 1522281600 |
| ... | ... | ... | ... | ... | ... |
| 497007 | B01HIQ9NGU | It did no harm, but hard to see any improvemen... | 4.0 | 2018-06-01 | 1527811200 |
| 497008 | B01HIV7FC4 | These are not rounded. I bought them for my li... | 4.0 | 2017-11-26 | 1511654400 |
| 497009 | B01HIV7FC4 | My destroyer French Bulldog was not able to de... | 5.0 | 2017-09-21 | 1505952000 |
| 497010 | B01HIV7FC4 | This is one of my dog's favorite toys, but all... | 4.0 | 2017-06-16 | 1497571200 |
| 497011 | B01HIV7FC4 | Best toy we've purchased for our new puppy. Ea... | 5.0 | 2017-05-04 | 1493856000 |
496672 rows × 5 columns
def remove_stopwords(tokens):
stop_words = set(stopwords.words('english'))
return [token for token in tokens if token not in stop_words]
def tokenize_and_preprocess(text):
if isinstance(text, str):
# Tokenize the text
tokens = word_tokenize(text.lower())
# Remove punctuation, URLs, mathematical symbols, and numbers
tokens = [token for token in tokens if re.match(r'^[a-zA-Z]+$', token)]
tokens = remove_stopwords(tokens)
# Apply stemming
porter = PorterStemmer()
tokens = [porter.stem(token) for token in tokens]
return tokens
else:
return []
df_review['review_tokens'] = df_review["reviewText"].apply(tokenize_and_preprocess)
comprehensive_tokens = list(chain.from_iterable(df_review['review_tokens']))
# Create a Counter object to count the frequency of each word
word_counts = Counter(comprehensive_tokens)
# Get the 10 most common words
most_common_words = word_counts.most_common(10)
most_common_words
[('dog', 229531),
('love', 159879),
('cat', 135637),
('like', 111007),
('use', 110351),
('one', 108985),
('great', 98700),
('get', 87993),
('work', 85084),
('would', 68607)]
Dog and cat are some of the most mentioned words, and most of the common words are positive
# initialize NLTK sentiment analyzer
analyzer = SentimentIntensityAnalyzer()
# create get_sentiment function
def get_sentiment(tokens):
text = ' '.join(tokens)
scores = analyzer.polarity_scores(text)
sentiment = scores['compound']
return sentiment
# apply get_sentiment function
df_review['sentiment'] = df_review['review_tokens'].apply(get_sentiment)
We then check if sentiment score correlate with the rating score.
import matplotlib.pyplot as plt
# Assuming 'df_review' is your DataFrame containing review data
# Plot overall rating against sentiment
plt.scatter(df_review['overall'], df_review['sentiment'], alpha=0.5)
plt.xlabel('Overall Rating')
plt.ylabel('Sentiment')
plt.title('Overall Rating vs Sentiment')
plt.show()
There are only 5 possible values of rating and because of the sheer size of the review dataset, alomst every sentiment score is given to each of the rating. So from this plot we can't tell anything. This is why we also want to try and average the rating, so we can get some more variation and perhaps see a correlation.
grouped_reviews = df_review.groupby('asin')
# Step 2: Aggregate 'reviewText' and 'overall' into lists
aggregated_reviews = grouped_reviews.agg({'review_tokens': list, 'overall': list, 'sentiment': list}).reset_index()
# Step 3: Calculate average rating for each product
aggregated_reviews['average_rating'] = aggregated_reviews['overall'].apply(lambda x: sum(x) / len(x))
aggregated_reviews['average_sentiment'] = aggregated_reviews['sentiment'].apply(lambda x: sum(x) / len(x))
# Step 4: Merge with your other DataFrame
merged_df = pd.merge(dfmeta, aggregated_reviews[['asin', 'review_tokens', 'average_rating','average_sentiment']], on='asin', how='left')
# Define a function to flatten the list of tokens
def flatten_tokens(tokens_list):
return list(chain.from_iterable(tokens_list))
# Apply the function to flatten the tokens in each row
merged_df['review_tokens'] = merged_df['review_tokens'].apply(flatten_tokens)
merged_df
| category | description | title | also_buy | price | asin | group | review_tokens | average_rating | average_sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [Pet Supplies, Top Selection from AmazonPets] | ['Volume 1: 96 Words & Phrases! This is th... | Pet Media Feathered Phonics The Easy Way To Te... | [B0002FP328, B0002FP32S, B0002FP32I, B00CAMARX... | $6.97 | 0972585419 | 0.0 | [ok, much, time, word, bird, get, bore, good] | 2.000000 | 0.246000 |
| 1 | [Pet Supplies, Dogs, Health Supplies] | ["Our Dog Whisperer with Cesar Milan Complete ... | Dog Whisperer With Cesar Millan: Season 1 | [B000QXDFSA, B0018BD9DK, B002RJ8YDM, B002UJIY3... | NaN | 1417084871 | NaN | [thought, show, would, teach, whisper, dog, in... | 5.000000 | 0.541600 |
| 2 | [Pet Supplies, Dogs, Treats] | ['"You won\'t want to miss this one from Paris... | The Healthy Hound Cookbook: Over 125 Easy Reci... | [1617690554, 1449409938, 1604334657, 163220674... | $14.75 | 1440572828 | 34.0 | [curiou, make, home, cook, food, supplement, n... | 4.100000 | 0.543700 |
| 3 | [Pet Supplies, Dogs, Health Supplies, Hip &... | ['Dr. Rexy hemp oil has powerful anti-inflamma... | DR.REXY Hemp Oil for Dogs and Cats - 100% Orga... | [] | $19.90 | 1612231977 | NaN | [disappoint, qualiti, significantli, deterior,... | 4.734694 | 0.612665 |
| 4 | [Pet Supplies, Dogs] | ['At last! A comprehensive, holistic guide for... | Natural Cures for Your Dog & Cat | [] | $19.91 | 1882330919 | NaN | [great, inform] | 5.000000 | 0.624900 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31691 | [Pet Supplies, Dogs, Collars, Harnesses & Leas... | ['Full Grip Supply Camo E-Bungee Collar is a r... | Full Grip Supply Camo E-Bungee Collar for Educ... | [B01KAX8QIO, B00W18D3F8, B005CXJ2OA, B005MJ65Z... | $16.99 | B01HIIJ4US | 2.0 | [awesom, littl, gadget, put, perfectli, time, ... | 4.500000 | 0.612425 |
| 31692 | [Pet Supplies, Cats, Flea & Tick Control, Flea... | ['Kills fleas, flea eggs, flea larvae and cont... | Sergeants Pet Care Prod 03282 Cat Flea/Tick Co... | [] | $3.34 | B01HIJGHOS | NaN | [got, new, puppi, ranch, rais, aka, flea, caus... | 2.250000 | 0.290050 |
| 31693 | [Pet Supplies, Cats, Flea & Tick Control, Flea... | ['Premium Quality Flea Comb for Dogs, Cats and... | #1 Pet Flea Comb For Dogs And Cats By Pet's Mu... | [B00JUQVR7U] | NaN | B01HIPJRBM | NaN | [long, hair, cat, love, never, like, brush, lo... | 4.500000 | 0.697500 |
| 31694 | [Pet Supplies, Dogs, Health Supplies, Suppleme... | ['Advita for Dogs is a blend of multiple probi... | VetOne Advita Probiotic Nutritional Supplement... | [B00DCV5E28, B078Y63641, B006CBD7LK, B077GHNQG... | $17.37 | B01HIQ9NGU | 8.0 | [good, probiot, mild, stress, coliti, dog, gre... | 4.428571 | 0.437914 |
| 31695 | [] | ['Latex Dog Toy Prepacks are creative combinat... | Zanies small latex dog toy with squeaker Pack ... | [] | $17.99 | B01HIV7FC4 | 2.0 | [round, bought, littl, dog, she, abl, pick, mo... | 4.500000 | 0.269575 |
31696 rows × 10 columns
top_5_sentiment_item = merged_df.nlargest(5, 'average_sentiment')
bottom_5_sentiment_item = merged_df.nsmallest(5, 'average_sentiment')
top_5_sentiment_item
| category | description | title | also_buy | price | asin | group | review_tokens | average_rating | average_sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4287 | [Pet Supplies, Cats, Litter & Housebreaking, L... | ["Never touch cat litter again. The new and un... | CatGenie-Self Washing, Self Flushing Cat Box | [] | NaN | B000MKHQG4 | NaN | [purchas, catgeni, directli, manufactur, follo... | 1.0 | 0.9975 |
| 11607 | [Pet Supplies, Dogs, Collars, Harnesses & Leas... | ["Like to run hands-free with your dog? Want t... | OllyDog Mt Tam Hands-Free Dog Leash and Runnin... | [] | NaN | B005URI7DA | NaN | [month, use, still, entir, convinc, greatest, ... | 5.0 | 0.9963 |
| 18533 | [Pet Supplies, Fish & Aquatic Pets, Aquariums ... | ['Aqueon brings a new stage of aquatic health ... | Aqueon BettaBow LED Desktop Fish Aquarium Kit | [] | $43.99 | B00INCRQMW | NaN | [bought, two, amazon, sure, review, show, veri... | 5.0 | 0.9952 |
| 14449 | [Pet Supplies, Cats, Beds & Furniture, Cat Tre... | ["Kitty'scape - A Whole New Way for Cats to Pl... | Solvit Kittyscape Cat Tree House Extra Large C... | [] | $99.99 | B00B19BGE8 | NaN | [third, kittyscap, cat, tree, hous, bought, go... | 5.0 | 0.9943 |
| 8844 | [Pet Supplies, Dogs, Health Supplies, Suppleme... | ["A Natural Whole food Herbal Multi-Vitamin an... | Dr. Harvey's MultiVitamin Mineral & Herbal... | [] | $14.69 | B003NTTYLQ | 8.0 | [email, harvey, websit, dentist, way, get, ful... | 5.0 | 0.9938 |
top_5_sentiment_item["title"][8844]
"Dr. Harvey's MultiVitamin Mineral & Herbal Supplement"
bottom_5_sentiment_item
| category | description | title | also_buy | price | asin | group | review_tokens | average_rating | average_sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21148 | [] | [] | Aquatic Arts 5 Live Freshwater Black Diamond S... | [B00CF0A7ZQ, B00HJEYUU6, B00GWMTT0C, B005CTKE4... | $23.95 | B00NUGP2FE | 16.0 | [one, shrimp, dead, arriv, anoth, one, look, e... | 3.0 | -0.9727 |
| 16369 | [Pet Supplies, Dogs, Treats, Bully Sticks] | ["Downtown Pet Supply Curly Bully Sticks are 1... | Downtown Pet Supply Best Free Range 10" T... | [] | $69.99 | B00DU23H5K | 5.0 | [horrif, experi, mini, goldendoodl, chew, one,... | 1.0 | -0.9654 |
| 13751 | [Pet Supplies, Cats, Cat Doors, Steps, Nets & ... | ['The Cat Mate Super Selective Chip and Disc c... | Pet Mate Cat Mate Elite Chip And Disc Supersel... | [] | $114.95 | B009GODTTK | NaN | [mine, two, annoy, failur, unit, lost, abil, r... | 2.0 | -0.9442 |
| 8292 | [Pet Supplies, Birds, Cages & Accessories, Bir... | ['HQ\'s Opening Dome Top Parrot cage is a perf... | HQ Open Dometop Birdcage with Stand | [B00BUEV9AU, B00GM49Y5U, B00LE594M0] | $148.91 | B002USI7YG | 0.0 | [lot, hesit, cage, final, satisfactori, conur,... | 3.0 | -0.9432 |
| 12620 | [Pet Supplies, Fish & Aquatic Pets, Aquarium W... | ['Excessive algae growth is the most common co... | Fritz Aquatics AFA48016 Algae Clean Out for Aq... | [] | $15.55 | B007GCE2W2 | NaN | [instanc, requir, repeat, use, work, use, tank... | 4.0 | -0.9201 |
From the top 5 and bottom 5 we can't really see any pattern for which type of product gets a high or low sentiment. 4/5 in top5 don't have a group. but we can see that the sentiment do not always represent the rating value. The top one of the top 5 have a very high sentiment but the averge rating for this product is only 1.
# Plot overall rating against sentiment
plt.scatter(merged_df['average_rating'], merged_df['average_sentiment'], alpha=0.5)
plt.xlabel('Average Overall Rating')
plt.ylabel('Average Sentiment')
plt.title('Overall Rating vs Sentiment')
plt.show()
We see a better correlation in this plot compared to the last plot. There are a slight positive trend which means that higher rating will also give a higher sentiment. We still see the stripes on the 5 values that the ratings can be. This is because some product only have 1 reviews that makes the rating value the averge rating value.
We see in the dataframe that some of the items dont have a price, so to see if price has a impact we focus solely on products with a price, and dropped the entries without.
merged_df.dropna(subset=['price'], inplace=True)
merged_df['price'] = merged_df['price'].str.replace('$', '')
# Replace non-numeric values with NaN
merged_df['price'] = pd.to_numeric(merged_df['price'], errors='coerce')
# Now, convert the column to float
merged_df['price'] = merged_df['price'].astype(float)
# Plot overall rating against price
plt.scatter(merged_df['average_rating'], merged_df['price'], alpha=0.5)
plt.xlabel('Average Overall Rating')
plt.ylabel('Price')
plt.yscale('log')
plt.title('Overall Rating vs Price')
plt.show()
# Plot overall sentiment against price
plt.scatter( merged_df['average_sentiment'],merged_df['price'], alpha=0.5)
plt.xlabel('Average Sentiment')
plt.ylabel('Price')
plt.yscale('log')
plt.title('Sentiment vs Price')
plt.show()
From these two plots we opserve that the price does not correlate with the rating or sentiment of the item.
We want to check if we are missing some contextual understanding and semantic information therefore we check for bigrams.
bigrams = list(bigrams(comprehensive_tokens))
def compute_contingency_tables(input_bigrams):
# Count occurrences of each bigram and its components
bigram_counter = Counter(input_bigrams)
word1_counter = Counter(word1 for word1, _ in input_bigrams)
word2_counter = Counter(word2 for _, word2 in input_bigrams)
# Compute contingency tables for each unique bigram
contingency_tables = {}
for bigram in tqdm(set(input_bigrams)):
word1, word2 = bigram
n_ii = bigram_counter[bigram]
n_io = word1_counter[word1] - n_ii
n_oi = word2_counter[word2] - n_ii
n_oo = len(input_bigrams) - n_ii - n_io - n_oi
contingency_tables[bigram] = {
'n_ii': n_ii,
'n_io': n_io,
'n_oi': n_oi,
'n_oo': n_oo
}
return contingency_tables
contingency_tables = compute_contingency_tables(bigrams)
100%|█████████████████████████████| 2173253/2173253 [00:07<00:00, 310377.49it/s]
def compute_expected_contingency_tables(contingency_tables):
expected_contingency_tables = {}
for bigram, contingency_table in tqdm(contingency_tables.items()):
n_ii = contingency_table['n_ii']
n_io = contingency_table['n_io']
n_oi = contingency_table['n_oi']
n_oo = contingency_table['n_oo']
R1 = n_ii + n_io
C1 = n_ii + n_oi
R2 = n_oi + n_oo
C2 = n_io + n_oo
N = R1 + C1 + R2 + C2
expected_table = {
'(R1 * C1) / N': (R1 * C1) / N,
'(R1 * C2) / N': (R1 * C2) / N,
'(R2 * C1) / N': (R2 * C1) / N,
'(R2 * C2) / N': (R2 * C2) / N }
expected_contingency_tables[bigram] = expected_table
return expected_contingency_tables
expected_contingency_tables = compute_expected_contingency_tables(contingency_tables)
100%|█████████████████████████████| 2173253/2173253 [00:06<00:00, 324126.37it/s]
def compute_chi_squared_statistics(observed_contingency_table, expected_contingency_table):
chi_squared_statistics = {}
for bigram, observed_values in tqdm(observed_contingency_table.items()):
expected_values = expected_contingency_table[bigram]
chi_squared = []
for i in range(len(observed_values.values())):
Oij = list(observed_values.values())[i]
Eij = list(expected_values.values())[i]
chi_squared.append(((Oij - Eij) ** 2) / Eij)
chi_squared_statistics[bigram] = sum(chi_squared)
return chi_squared_statistics
chi_squared = compute_chi_squared_statistics(contingency_tables,expected_contingency_tables)
100%|█████████████████████████████| 2173253/2173253 [00:13<00:00, 155523.50it/s]
def compute_p_value(chi_squared_statistics):
p_values = {}
for key, chi_squared_statistic in tqdm(chi_squared_statistics.items()):
p_value = chi2.sf(chi_squared_statistic, df=1)
p_values[key] = p_value
return p_values
p_values = compute_p_value(chi_squared)
100%|██████████████████████████████| 2173253/2173253 [02:13<00:00, 16248.45it/s]
# Find collocations
collocations = []
for bigram, appearance in tqdm(contingency_tables.items()):
if list(appearance.values())[0] > 50 and p_values.get(bigram, float('inf')) < 0.001:
collocations.append((bigram, list(appearance.values())[0]))
100%|████████████████████████████| 2173253/2173253 [00:01<00:00, 1145357.45it/s]
len(collocations)
22693
collocations.sort(key=lambda x: x[1], reverse=True)
collocations[:20]
[(('dog', 'love'), 31576),
(('cat', 'love'), 17892),
(('work', 'great'), 15663),
(('work', 'well'), 13464),
(('well', 'made'), 8932),
(('highli', 'recommend'), 8708),
(('great', 'product'), 8502),
(('dog', 'food'), 8389),
(('litter', 'box'), 8338),
(('dog', 'like'), 8283),
(('year', 'old'), 7598),
(('cat', 'like'), 6318),
(('realli', 'like'), 6276),
(('good', 'qualiti'), 5947),
(('seem', 'like'), 5894),
(('small', 'dog'), 5856),
(('great', 'price'), 5177),
(('would', 'recommend'), 5071),
(('look', 'like'), 4895),
(('last', 'long'), 4876)]
When we look at the bigrams, it doesn't look like there is much contextual understanding and semantic information that is not captured by the unigrams. There is no negative that changes the sentiment or anything that can change the meaning of the words. This is not all the bigrams, but we can comfortably say that the top 20 bigrams do not incluede a negative starter word eg. "Not love"
To get a better understanding for each group we will compute some of the most commen words and thier IDF for the 5 groups with most members.
group_counter = Counter(partition.values())
# Get the top 5 groups with largest number of items
top_5_groups = group_counter.most_common(5)
top_5_groups
[(2, 4402), (3, 3230), (5, 2741), (16, 2216), (8, 1923)]
groups = merged_df.groupby('group')
item_groups = {}
# Drop rows where 'group' is NaN
valid_groups = merged_df.dropna(subset=['group'])
# Iterate over each group
for i in tqdm(list(set(valid_groups["group"]))):
nest = set(tuple(tokens) for tokens in groups.get_group(i)["review_tokens"].tolist())
nested_lists = list(nest) # Convert set of tuples to a list of tuples
# Flatten the list of tuples into a single list using list comprehension
item_groups[i] = [item for sublist in nested_lists for item in sublist]
100%|███████████████████████████████████████████| 45/45 [00:01<00:00, 41.36it/s]
def calculate_idf(word_freq, dataframe):
# Total number of documents in the DataFrame
total_documents = len(dataframe)
# Count the number of documents containing each word
documents_containing_word = defaultdict(int)
for word, _ in word_freq:
for tokens in dataframe['review_tokens']:
if word in tokens:
documents_containing_word[word] += 1
# Calculate IDF for each word
idf_values = {}
for word, _ in word_freq:
if documents_containing_word[word] != 0:
idf_values[word] = math.log(total_documents / documents_containing_word[word])
else:
idf_values[word] = 0 # Handle the case where a word doesn't appear in any document
return idf_values
for group,_ in top_5_groups:
print("")
print(f"Group {group}")
most_common_ = Counter(item_groups[group]).most_common(10)
idf_ = calculate_idf(most_common_, merged_df)
for word, feq in most_common_:
idf = idf_.get(word, 0) # Get IDF value for the word from idf_g2 dictionary
print(f"Word: {word}, feq: {feq}, IDF: {idf}")
Group 2 Word: dog, feq: 77482, IDF: 0.6089354893950956 Word: love, feq: 38659, IDF: 0.4576897882944777 Word: one, feq: 28581, IDF: 0.5821226163716092 Word: great, feq: 27079, IDF: 0.5069164280454364 Word: use, feq: 23761, IDF: 0.5781915896146571 Word: like, feq: 23597, IDF: 0.5195414570013981 Word: toy, feq: 23593, IDF: 1.8884657402044895 Word: get, feq: 21089, IDF: 0.6331186897731702 Word: work, feq: 17730, IDF: 0.713956792023889 Word: well, feq: 17471, IDF: 0.69614119477255 Group 3 Word: cat, feq: 75739, IDF: 1.3086691990893544 Word: love, feq: 28619, IDF: 0.4576897882944777 Word: like, feq: 21115, IDF: 0.5195414570013981 Word: one, feq: 20710, IDF: 0.5821226163716092 Word: litter, feq: 18233, IDF: 2.9433185308699508 Word: use, feq: 18149, IDF: 0.5781915896146571 Word: food, feq: 16496, IDF: 1.5515231735597126 Word: get, feq: 15395, IDF: 0.6331186897731702 Word: great, feq: 12350, IDF: 0.5069164280454364 Word: box, feq: 11937, IDF: 1.9633761958764524 Group 5 Word: dog, feq: 39088, IDF: 0.6089354893950956 Word: love, feq: 24446, IDF: 0.4576897882944777 Word: food, feq: 18655, IDF: 1.5515231735597126 Word: like, feq: 13248, IDF: 0.5195414570013981 Word: treat, feq: 13215, IDF: 1.7148261093696961 Word: one, feq: 8405, IDF: 0.5821226163716092 Word: eat, feq: 7803, IDF: 1.502397194756801 Word: cat, feq: 7354, IDF: 1.3086691990893544 Word: good, feq: 6927, IDF: 0.6504733889691939 Word: get, feq: 6650, IDF: 0.6331186897731702 Group 16 Word: tank, feq: 14028, IDF: 2.3010914407372622 Word: use, feq: 7459, IDF: 0.5781915896146571 Word: work, feq: 7449, IDF: 0.713956792023889 Word: water, feq: 7430, IDF: 1.6344108333654803 Word: fish, feq: 7338, IDF: 2.299434439529633 Word: great, feq: 6730, IDF: 0.5069164280454364 Word: filter, feq: 6565, IDF: 3.040048157328502 Word: one, feq: 5439, IDF: 0.5821226163716092 Word: like, feq: 4491, IDF: 0.5195414570013981 Word: good, feq: 4142, IDF: 0.6504733889691939 Group 8 Word: dog, feq: 25476, IDF: 0.6089354893950956 Word: use, feq: 11321, IDF: 0.5781915896146571 Word: work, feq: 10772, IDF: 0.713956792023889 Word: product, feq: 9336, IDF: 0.7319228839588409 Word: like, feq: 7161, IDF: 0.5195414570013981 Word: get, feq: 6778, IDF: 0.6331186897731702 Word: love, feq: 6456, IDF: 0.4576897882944777 Word: great, feq: 6264, IDF: 0.5069164280454364 Word: one, feq: 5743, IDF: 0.5821226163716092 Word: flea, feq: 5549, IDF: 3.7436666377557426
The animal isn't the main factor to divide the category, but it's factors like the use of the product, as we can see in group 16 where food is quite unique. SO even with high frequency words, we get a much better understanding of the groups items, when we look at IDF.
degrees = dict(G.degree())
degrees = sorted(degrees.items(), key=lambda x: x[1], reverse=True)
top_1000_nodes = [node for node, degrees in degrees[:1000]]
bottom_1000_nodes = [node for node, degrees in degrees[-1000:]]
# Filter merged_df based on the top 100 and bottom 100 nodes
top_1000_df = merged_df[merged_df['asin'].isin(top_1000_nodes)]
bottom_1000_df = merged_df[merged_df['asin'].isin(bottom_1000_nodes)]
avg_average_rating = top_1000_df['average_rating'].mean()
avg_average_sentiment = top_1000_df['average_sentiment'].mean()
print("Average of average_rating in top_1000:", avg_average_rating)
print("Average of average_sentiment in top_1000:", avg_average_sentiment)
avg_average_rating = bottom_1000_df['average_rating'].mean()
avg_average_sentiment = bottom_1000_df['average_sentiment'].mean()
print("Average of average_rating in bottom_1000:", avg_average_rating)
print("Average of average_sentiment in bottom_1000:", avg_average_sentiment)
Average of average_rating in top_1000: 4.283205006623296 Average of average_sentiment in top_1000: 0.4726882623015374 Average of average_rating in bottom_1000: 4.104591659289994 Average of average_sentiment in bottom_1000: 0.5096076121207322
The node degree don't seem to have an effect on sentiment or rating, as we can see that averge for the top and bottom is almost the same
sorted_df = merged_df.sort_values(by='price')
expensive_1000_item = sorted_df.tail(1000)
cheep_1000_items = sorted_df.head(1000)
avg_average_rating = expensive_1000_item['average_rating'].mean()
avg_average_sentiment = expensive_1000_item['average_sentiment'].mean()
print("Average of average_rating in top_1000:", avg_average_rating)
print("Average of average_sentiment in top_1000:", avg_average_sentiment)
avg_average_rating = cheep_1000_items['average_rating'].mean()
avg_average_sentiment = cheep_1000_items['average_sentiment'].mean()
print("Average of average_rating in bottom_1000:", avg_average_rating)
print("Average of average_sentiment in bottom_1000:", avg_average_sentiment)
Average of average_rating in top_1000: 4.189955511311651 Average of average_sentiment in top_1000: 0.5130592409297675 Average of average_rating in bottom_1000: 4.112984430621785 Average of average_sentiment in bottom_1000: 0.47387498297615827
We computed the same using price and find the same conclusion. Higher-priced products does not tend to receive more positive or negative feedback compared to lower-priced items.
rating_groups = {}
sentiment_groups = {}
for group_id, group_df in merged_df.groupby('group'):
rating_groups[group_id] = group_df['average_rating'].mean()
sentiment_groups[group_id] = group_df['average_sentiment'].mean()
# Sort the dictionary based on values
sorted_dict = dict(sorted(sentiment_groups.items(), key=lambda item: item[1]))
# Print the top 5 elements
print("Bottom 5:")
for key, value in list(sorted_dict.items())[:5]:
print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
print(key, ":", value)
# Print the bottom 5 elements
print("\nTop 5:")
for key, value in list(sorted_dict.items())[-5:]:
print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
print(key, ":", value)
Bottom 5:
the 5 most common words for group 22.0 [('dog', 101), ('work', 90), ('spot', 45), ('grass', 44), ('use', 33)]
22.0 : 0.30083530636892436
the 5 most common words for group 41.0 [('bag', 82), ('dog', 31), ('use', 22), ('good', 21), ('like', 20)]
41.0 : 0.3063075255637271
the 5 most common words for group 20.0 [('dog', 2581), ('work', 1929), ('collar', 1815), ('use', 1198), ('bark', 1033)]
20.0 : 0.3488743727609684
the 5 most common words for group 14.0 [('cat', 41), ('claw', 14), ('nail', 14), ('get', 11), ('cap', 11)]
14.0 : 0.3560346153846154
the 5 most common words for group 44.0 [('dog', 26), ('fenc', 23), ('work', 21), ('one', 18), ('bumper', 17)]
44.0 : 0.37356504385964906
Top 5:
the 5 most common words for group 25.0 [('dog', 3), ('love', 2), ('collar', 2), ('last', 2), ('look', 1)]
25.0 : 0.6475500000000001
the 5 most common words for group 35.0 [('collar', 328), ('leash', 212), ('dog', 197), ('love', 161), ('lupin', 154)]
35.0 : 0.6542897007846774
the 5 most common words for group 1.0 [('bag', 10), ('food', 7), ('great', 6), ('dog', 5), ('perfect', 3)]
1.0 : 0.6761033333333334
the 5 most common words for group 17.0 [('cat', 66), ('bed', 49), ('love', 27), ('like', 20), ('one', 15)]
17.0 : 0.7922762195121952
the 5 most common words for group 10.0 [('toy', 5), ('one', 3), ('beak', 2), ('eyebal', 2), ('come', 2)]
10.0 : 0.8119000000000001
# Sort the dictionary based on values
sorted_dict = dict(sorted(rating_groups.items(), key=lambda item: item[1]))
# Print the top 5 elements
print("Bottom 5:")
for key, value in list(sorted_dict.items())[:5]:
print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
print(key, ":", value)
# Print the bottom 5 elements
print("\nTop 5:")
for key, value in list(sorted_dict.items())[-5:]:
print("the 5 most common words for group",key,Counter(item_groups[key]).most_common(5))
print(key, ":", value)
Bottom 5:
the 5 most common words for group 22.0 [('dog', 101), ('work', 90), ('spot', 45), ('grass', 44), ('use', 33)]
22.0 : 2.8783080790844764
the 5 most common words for group 4.0 [('help', 2), ('ice', 2), ('walk', 2), ('warm', 2), ('worth', 2)]
4.0 : 3.0
the 5 most common words for group 32.0 [('collar', 9), ('bright', 8), ('dog', 6), ('light', 5), ('batteri', 4)]
32.0 : 3.4711538461538463
the 5 most common words for group 39.0 [('dog', 11), ('toy', 9), ('love', 5), ('chewer', 5), ('product', 5)]
39.0 : 3.525
the 5 most common words for group 14.0 [('cat', 41), ('claw', 14), ('nail', 14), ('get', 11), ('cap', 11)]
14.0 : 3.5679487179487177
Top 5:
the 5 most common words for group 35.0 [('collar', 328), ('leash', 212), ('dog', 197), ('love', 161), ('lupin', 154)]
35.0 : 4.825128272558179
the 5 most common words for group 17.0 [('cat', 66), ('bed', 49), ('love', 27), ('like', 20), ('one', 15)]
17.0 : 4.926829268292683
the 5 most common words for group 1.0 [('bag', 10), ('food', 7), ('great', 6), ('dog', 5), ('perfect', 3)]
1.0 : 5.0
the 5 most common words for group 10.0 [('toy', 5), ('one', 3), ('beak', 2), ('eyebal', 2), ('come', 2)]
10.0 : 5.0
the 5 most common words for group 25.0 [('dog', 3), ('love', 2), ('collar', 2), ('last', 2), ('look', 1)]
25.0 : 5.0
By observing the top and bottom groups and their most common words we can't see any kind of trend between the type of product and the sentiment or rating but many of the top groups include dog so does the bottom.
nan_group_df = merged_df[merged_df['group'].isna()]
nan_group_df['average_rating'].mean()
4.102630306984782
nan_group_df['average_sentiment'].mean()
0.4927189153512092
We checked if the items not in a group have some abnormal rating or sentinment values, but it have values simular to the other groups. So whether if the item is in the largest component or not, have no impact on the sentiment or rating of the product.
To get a more clear idea of what the communites is based on we plot the wordcloud of the top 9 communities with most members.
top_9_groups = group_counter.most_common(9)
# Extract top 9 group ids
top_9_group_ids = [group[0] for group in top_9_groups]
# Filter df_filtered to include only the top 9 groups
df_filtered_top_9 = merged_df[merged_df['group'].isin(top_9_group_ids)]
# Create a 3x3 grid of subplots
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
# Iterate over each group and corresponding axis
for (group_id, group_df), ax in zip(df_filtered_top_9.groupby('group'), axes.flatten()):
# Flatten and concatenate tokens for the group
group_tokens = ' '.join(chain.from_iterable(group_df['review_tokens']))
# Generate word cloud for the group
wordcloud = WordCloud(width=400, height=400, background_color='white').generate(group_tokens)
# Display the word cloud on the corresponding subplot
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_title('Word Cloud for Group {}'.format(int(group_id)))
ax.axis('off')
# Adjust layout
plt.tight_layout()
plt.show()
In the communities with most members we see that many of the communities are about dogs, but within this there are different categories such as in group 1 which is about grooming and group 9 is about dog treat/food. Because we are using the reviews, some words might not be that informative and which are words such as: like, love etc. that don't help us much in finding the category of the community, but instead represent some sentiment value.
Reflecting on our discussion, we had some wins and spots where we could do better in our analysis.One good thing was that we managed to find important insights. The main questions we wanted answered was possible to concluede to some extend. There are of course always more ways to filter and look at the data, but as it stands we got a clear understanding of what influences the co-purchase patterns.
However, we spotted some gaps we need to fill. One is in sentiment analysis. We noticed that sometimes the sentiment didn't match the ratings. This happened because we didn't handle words like "not good","didn't distroy" properly. So, some comments were misunderstood which might have skewed our analysis. Fixing this would make our understanding of customer feelings better and make our findings more accurate.
Also, it's important to use all the available data. While we did with what our computers could work with, including all the data could give us deeper insights and a fuller picture. This way, our conclusions would be based on a better understanding of the data, making our analysis stronger.
Looking ahead, there are steps we can take to make our analysis even better. For example, focusing on higher-core reviews—those with more than ten reviews—would help balance out any unusual reviews and give us a more accurate picture. Also, the data have a 2023 version, using newer data would keep our findings up-to-date and relevant.
We found that certain product categories appear in both the top and bottom sentiment rankings. Combining these similar categories into one group could provide a clearer understanding of their influence on sentiment and ratings.
To sum up, our analysis gave us useful insights, but there's room for improvement. By fixing these areas and making the suggested changes, we can make our analysis stronger and learn more about co-purchase patterns and related topics.
In our investigation, we aimed to uncover the factors influencing co-purchasing habits among pet supply customers and how these factors impact their product reviews. Additionally, we sought to determine whether price influences these reviews.
Through our analysis, we discovered that customers tend to co-purchase items primarily within the same animal species and product category, such as food or grooming products. While certain communities exhibit a more positive sentiment, some communities within the same category also appear in the less favorable rankings. Dogs are prevalent in both the top and bottom five groups, making it challenging to conclude whether a specific animal or category consistently leads to better product reviews. Additionally, our findings revealed that price does not significantly affect reviews.